1 Mixed Effects Models

The motivation for using mixed models stems from the intention to incorporate efforts to handle nuisance variance in the population we sampled as a part of our experiment. One of the main sources of this nuisance variance can be non-independence, which showed up in the split-plot design and repeated measures experiments.

1.1 Terminology

Mixed models are known by a variety of names that is usually sub-field specific. The most common terms you may come across are: * Variance components * Random intercepts and slopes * Random effects * Varying coefficients * Intercepts - and/or slopes-as-outcomes * Hierarchical linear models * Multilevel models (implies multiple levels of hierarchically clustered data) * Growth curve models (possibly Latent GCM) * Mixed effects models

We will be using mixed effects models, which generally refer to the inclusion of both fixed and random effects. This terminology does not suggest specific structure (e.g., hierarchical, or multilevel). The term fixed effects refers to the non-random terms included in the model, which includes the variables of experimental interest. Random effects are often considered to be different grouping factors that may help define clustering of observational units.

1.2 Random Intercept Model

The simplest and most common case of a mixed effects model is where there is a single grouping or cluster structure for the random effect. This is known as a random intercept model.

1.2.1 Review the Experimental Design

The first example comes from the InsectSprays data set where multiple pesticide sprays are used and the count of insects are observed.

1.2.2 Import the Data

#Loading the Data
data <- read_csv(here("data","Random_Intercept_Model.csv"))
## Rows: 48 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): spray
## dbl (1): count
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [48 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ count: num [1:48] 10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: chr [1:48] "A" "A" "A" "A" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   count = col_double(),
##   ..   spray = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

1.2.3 Exploring the Data

ggplot(data, aes(x = spray, y = count)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(height = 0, width = 0.1)

Tested to see if there is a relationship between the different sprays used and the count of insects observed. We can add blocks to the data by constructing a vector of factor variables. This code builds 12 different blocks in the data set with each block being comprised of 4 replicates for a total of 48 replicates or samples. Each boxplot is just a point and line.

data$block <- as.factor(rep(c(1:12), 4))
glimpse(data)
## Rows: 48
## Columns: 3
## $ count <dbl> 10, 7, 20, 14, 14, 12, 10, 23, 17, 20, 14, 13, 11, 17, 21, 11, 1…
## $ spray <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B",…
## $ block <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9…
ggplot(data, aes(x = spray, y = count)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(height = 0,width = 0.1) +
  facet_wrap(~ block)  # 12 blocks

1.2.4 Fitting ANOVA Models with R

The initial model considers the variation in count among the different sprays and ignores the blocking design.

# Running the Model ANOVA - ignoring the block design
Mod <- glm(count ~ spray, data = data)
car::Anova(Mod, test.statistic = "F")
Sum Sq Df F value Pr(>F)
spray 1648.729 3 26.47836 0
Residuals 913.250 44 NA NA

We can account for the blocking factor by adding the block variable as a fixed effect.

Mod2 <- glm(count ~ spray + block, data = data)
car::Anova(Mod2, test.statistic = "F")
Sum Sq Df F value Pr(>F)
spray 1648.7292 3 32.529764 0.0000000
block 355.7292 11 1.914166 0.0737824
Residuals 557.5208 33 NA NA

When we aren’t necessarily interested in testing the significance, or estimating parameters, for the block effect - we just want to account for it. Therefore, block may be more appropriately fitted as a random effect. This is called a linear mixed effects model with a block as a random effect (random intercept). The random intercept is specified using the (1|intercept) notation. This notation is used for both glmmTMB and lmer functions.

Mod3 <- lmer(count ~ spray + (1|block), data = data)
# Mod3 <- glmmTMB(count ~ spray + (1|block), data = data)
# anova(Mod) - no anova() method for glmmTMB
car::Anova(Mod3, test.statistic = "F")
F Df Df.res Pr(>F)
spray 32.52976 3 33 0
summary(Mod3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: count ~ spray + (1 | block)
##    Data: data
## 
## REML criterion at convergence: 266.3
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.95239 -0.58269 -0.07399  0.60466  1.99778 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  block    (Intercept)  3.861   1.965   
##  Residual             16.895   4.110   
## Number of obs: 48, groups:  block, 12
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  14.5000     1.3152  11.025
## sprayB        0.8333     1.6780   0.497
## sprayC      -12.4167     1.6780  -7.400
## sprayF        2.1667     1.6780   1.291
## 
## Correlation of Fixed Effects:
##        (Intr) sprayB sprayC
## sprayB -0.638              
## sprayC -0.638  0.500       
## sprayF -0.638  0.500  0.500

1.2.5 Assumptions

Now we can look at the diagnostics from both the fixed effects and mixed effects models.

plot(resid(Mod2) ~ fitted(Mod2))
abline(h = 0)

plot(resid(Mod3) ~ fitted(Mod3))
abline(h = 0)

simulateResiduals(Mod2, plot = T)

## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
##  
## Scaled residual values: 0.484 0.148 0.64 0.476 0.488 0.324 0.276 0.956 0.332 0.636 0.512 0.484 0.496 0.916 0.62 0.196 0.552 0.488 0.844 0.584 ...
simulateResiduals(Mod3, plot = T)

## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
##  
## Scaled residual values: 0.16 0.064 0.896 0.46 0.388 0.296 0.136 0.96 0.736 0.896 0.456 0.384 0.192 0.648 0.876 0.216 0.548 0.412 0.612 0.652 ...
qqnorm(resid(Mod2))
qqline(resid(Mod2))

lattice::qqmath(Mod3) # doesn't work with glm or lm models

1.2.6 Mean Comparisons

Finally we can compare the estimated marginal means.

emmeans2 <- Mod2 %>%
  emmeans(specs = "spray", type = "response") %>%
  cld(Letters = letters)
emmeans2
spray emmean SE df lower.CL upper.CL .group
3 C 2.083333 1.186542 33 -0.3307036 4.49737 a
1 A 14.500000 1.186542 33 12.0859630 16.91404 b
2 B 15.333333 1.186542 33 12.9192964 17.74737 b
4 F 16.666667 1.186542 33 14.2526297 19.08070 b
emmeans3 <- Mod3 %>%
  emmeans(specs = "spray", type = "response") %>%
  cld(Letters = letters)
emmeans3
spray emmean SE df lower.CL upper.CL .group
3 C 2.083333 1.315158 39.86165 -0.5749872 4.741654 a
1 A 14.500000 1.315158 39.86165 11.8416794 17.158321 b
2 B 15.333333 1.315158 39.86165 12.6750128 17.991654 b
4 F 16.666667 1.315158 39.86165 14.0083461 19.324987 b

1.2.7 Data Visualization

emmeans2 <- emmeans2 %>%
  as_tibble()
emmeans3 <- emmeans3 %>%
  as_tibble()

Plot_Mod2 <- ggplot() +
  geom_point(data = data, aes(y = count, x = spray), position = position_jitter(width = 0.1)) + # dots representing the raw data
  geom_boxplot(data = data, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
  geom_point(data = emmeans2, aes(y = emmean, x = spray), position = position_nudge(x = 0.15), size = 2,color = "red") + # red dots representing the adjusted means
  geom_errorbar(data = emmeans2, aes(ymin = lower.CL, ymax = upper.CL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
  geom_text(data = emmeans2, aes(y = emmean, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters 
  labs(y = "Counts", x = "Treatment") +
  theme_classic()
Plot_Mod2

Plot_Mod3 <- ggplot() +
  geom_point(data = data, aes(y = count, x = spray), position = position_jitter(width = 0.1)) + # dots representing the raw data
  geom_boxplot(data = data, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
  geom_point(data = emmeans3, aes(y = emmean, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
  geom_errorbar(data = emmeans3, aes(ymin = lower.CL, ymax = upper.CL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
  geom_text(data = emmeans3, aes(y = emmean, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters 
  labs(y = "Counts", x = "Treatment") +
  theme_classic()
Plot_Mod3

The estimates for the fixed effects and random effects model produce similar results (wider confidence intervals for the random effects model). However, by treating the block as random effect, we can conceptualize these blocks as representing a sub-sampling of the larger population of potential blocks. This allows us to apply the conclusions from this model to other blocks from the large population.

This question, “should I treat factor XXX as fixed or random?” can be a very complicated question with competing opinions and definitions. Some examples from relevant literature are given for context. You should think critically and find evidence to support your decisions.

Fixed effects are unknown constants that we wish to estimate from the model and could be similar estimates in subsequent experimentation. The research is interested in these particular estimates.

Random effects are random variables sampled from a population which cannot be observed in subsequent experimentation. The research is not interested in these particular levels, but rather how the estimates vary from sample to sample.

From Gelman 2005: 1) Fixed effects are constant across individuals, and random effects vary - (e.g., random intercepts and fixed slopes results in parallel lines with random intercepts). 2) Effects are fixed if they are interesting in themselves or random if there is interest in the underlying population. 3) When a sample exhausts the population, the corresponding variable is fixed; when the sample is a small (i.e., negligible) part of the population the corresponding variable is random. 4) If an effect is assumed to be a realized value or a random variable, it is called a random effect. 5) Fixed effects are estimated using least squares (or, more generally, maximum likelihood) and random effects are estimated with shrinkage (e.g., linear unbiased prediction).

“…one modeler’s random effect is another modeler’s fixed effect” (Schabenberger and Pierce 2001).

One idea that is particularly contested involves the number of levels within a random effect (5-6 minimum). This advice comes from a pragmatic approach to ensure that you have enough variance of the population, but a more philosophical forward opinion would focus on the purpose of the term as it is included in the model.

Another major differences for mixed-effect models is that we can calculate the variance component of the random effects. The variance component is how much variation there is among the intercepts of the levels of the random effect. In the InsectSprays example, if the block had very little effect on the insect counts (all blocks are about the same), the variance component would be low (near zero). However, if there was a large amount of variation among the blocks (some blocks as very few insects and some had a lot), the variance component would be high. The concept of variance components is closely related to the coefficient of determination or the R-squared.

We know where the R-squared value is reported with a fixed effects model. There are a few more steps to extract the R-squared value for a mixed effects model or when we use the glmmTMB function. The fixed effects model gives us the R-squared and Adjusted R-squared value. The mixed effects model gives us the Conditional R-squared (fixed plus random effects) and Marginal R-squared (fixed effects). This gives us an assessment of how much variance is attributed to the random effect compared to the whole model.

r2(Mod2)
##   R2: 0.782
r2(Mod3)
## # R2 for Mixed Models
## 
##   Conditional R2: 0.697
##      Marginal R2: 0.628

Mixed effects models become further complicated when we move beyond the simple random intercept model.

1.3 Kinds of Random Effects Structure

Mixed effects models can have one or multiple sources of clustering that require fitting different random effects to best capture the variance attributed to those observational units. The structure of this clustering may be hierarchical or other complicated forms, which can be a source of confusion for those new to mixed effects models. A classic example would be students GPA observed multiple times (repeated observations nested within students) from multiple classrooms (multiple students nested within classroom) across multiple schools (classrooms nested within schools). Another example may be yield of a random selection crop varieties tested at multiple locations.

Another set of terms you may come across when discussing random effects are nested and crossed. These terms are used to describe the data, but recognizing this characteristic of the data can be important when specifying the random effects terms in the model. Nested random effects are when each member of one group is contained entirely within a single unit of another group - think about the student in the classroom example above. Crossed random effects are when this nesting is not true - think about the different crop varieties tested at multiple locations where the same variety can be tested at multiple locations and each location can have multiple varieties.

This mainly comes down to how each level of clustered data are coded. As long as the levels of the nested variable are unique across the data as opposed to unique within each of the nesting variable, nested effects and crossed effects are identical. Review this link for more explanation:

1.4 Multiple Random Intercepts

1.4.1 Review the Experimental Design

You may recognize this model structure from the examples used in the split-plot design lecture.

The data for this example is a slightly modified version of the yield (kg/ha) trial laid out as a split-plot design (Gomez & Gomez 1984). The trial had 4 genotypes (G), 6 nitrogen levels (N or n_amount) with 3 complete replicates (rep) and 6 incomplete blocks (mainplot) within each replicate.

1.4.2 Importing the Data

dat <- read_csv(here("data","Multiple_Random_Intercept_Model.csv"))
## Rows: 72 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): rep, mainplot, G, N
## dbl (4): yield, row, col, n_amount
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert Variables to Factors
dat <- dat %>%
  mutate_at(vars(rep, mainplot, N, G), as.factor)

1.4.3 Exploring the Data

desplot(data = dat,
        form = rep ~ col + row|rep, # fill color per rep, headers per rep
        text = G, cex = 1, shorten = "no", # show genotype names per plot
        col = N, # color of genotype names for each N-level
        out1 = mainplot, out1.gpar = list(col = "black"), # lines between mainplots
        out2 = row, out2.gpar = list(col = "darkgrey"), # lines between rows
                main = "Field Layout", show.key = T, key.cex = 0.7) # formatting

1.4.4 Fitting ANOVA Models with R and Assumptions

mod_re <- lmerTest::lmer(yield ~ N * G + (1|rep) + (1|rep:mainplot), data = dat) # (1|rep/mainplot) same syntax for lmer function - glmmTMB may produce different result
## boundary (singular) fit: see help('isSingular')
# isSingular warning message tells us that one of the random effects is accounting for very little variation - since this random effect is related to the experimental design we will NOT ignore it
summary(mod_re)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: yield ~ N * G + (1 | rep) + (1 | rep:mainplot)
##    Data: dat
## 
## REML criterion at convergence: 780.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8506 -0.5356 -0.1311  0.4778  1.9272 
## 
## Random effects:
##  Groups       Name        Variance  Std.Dev. 
##  rep:mainplot (Intercept) 5.086e+04 225.51172
##  rep          (Intercept) 3.070e-03   0.05541
##  Residual                 3.494e+05 591.13579
## Number of obs: 72, groups:  rep:mainplot, 18; rep, 3
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  4252.67     365.28    45.78  11.642 2.79e-15 ***
## NN2          1419.33     516.59    45.78   2.748  0.00856 ** 
## NN3          2147.33     516.59    45.78   4.157  0.00014 ***
## NN4          2480.00     516.59    45.78   4.801 1.73e-05 ***
## NN5          3310.67     516.59    45.78   6.409 7.18e-08 ***
## NN6          4448.00     516.59    45.78   8.610 3.94e-11 ***
## GB             53.33     482.66    36.00   0.110  0.91263    
## GC          -1075.33     482.66    36.00  -2.228  0.03222 *  
## GD            228.67     482.66    36.00   0.474  0.63853    
## NN2:GB        256.67     682.58    36.00   0.376  0.70911    
## NN3:GB       -194.33     682.58    36.00  -0.285  0.77750    
## NN4:GB        109.00     682.58    36.00   0.160  0.87402    
## NN5:GB       -666.00     682.58    36.00  -0.976  0.33572    
## NN6:GB      -2213.67     682.58    36.00  -3.243  0.00255 ** 
## NN2:GC        846.00     682.58    36.00   1.239  0.22321    
## NN3:GC        669.33     682.58    36.00   0.981  0.33334    
## NN4:GC        356.67     682.58    36.00   0.523  0.60451    
## NN5:GC        199.33     682.58    36.00   0.292  0.77194    
## NN6:GC      -1560.00     682.58    36.00  -2.285  0.02828 *  
## NN2:GD      -1084.67     682.58    36.00  -1.589  0.12079    
## NN3:GD      -1816.67     682.58    36.00  -2.661  0.01155 *  
## NN4:GD      -3145.33     682.58    36.00  -4.608 4.95e-05 ***
## NN5:GD      -5745.33     682.58    36.00  -8.417 5.01e-10 ***
## NN6:GD      -7048.67     682.58    36.00 -10.326 2.62e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation matrix not shown by default, as p = 24 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
plot(resid(mod_re) ~ fitted(mod_re))
abline(h = 0)

# Review contribution of each random effect and their predictors (BLUPs)
summary(mod_re)$varcor
##  Groups       Name        Std.Dev.  
##  rep:mainplot (Intercept) 225.511719
##  rep          (Intercept)   0.055412
##  Residual                 591.135787
ranef(mod_re)$rep
(Intercept)
rep1 -1.45e-05
rep2 2.29e-05
rep3 -8.30e-06
ranef(mod_re)$`rep:mainplot`
(Intercept)
rep1:mp1 9.811810
rep1:mp2 63.224824
rep1:mp3 -218.067361
rep1:mp4 -28.055625
rep1:mp5 -153.708053
rep1:mp6 86.282566
rep2:mp10 -7.297538
rep2:mp11 207.335697
rep2:mp12 170.756061
rep2:mp7 116.821795
rep2:mp8 7.297522
rep2:mp9 -116.269898
rep3:mp13 47.311300
rep3:mp14 -179.280072
rep3:mp15 36.886258
rep3:mp16 -2.514272
rep3:mp17 -70.522346
rep3:mp18 29.987332

1.4.5 Mean Comparisons and Data Visualization

Here we have multiple random effects - replicate and mainplot within replicate.

mod_re %>% car::Anova(type = 3,test.statistic = "F")
F Df Df.res Pr(>F)
(Intercept) 135.538147 1 45.78314 0.0000000
N 17.621854 5 41.96547 0.0000000
G 3.016614 3 36.00000 0.0424007
N:G 13.235986 15 36.00000 0.0000000
withinG_mean_comparisons_tukey_re <- mod_re %>%
  emmeans(specs = ~ N|G) %>%
  cld(Letters = letters)
withinG_mean_comparisons_tukey_re
N G emmean SE df lower.CL upper.CL .group
1 N1 A 4252.667 365.2839 45.78314 3517.294 4988.039 a
2 N2 A 5672.000 365.2839 45.78314 4936.628 6407.372 ab
3 N3 A 6400.000 365.2839 45.78314 5664.628 7135.372 bc
4 N4 A 6732.667 365.2839 45.78314 5997.294 7468.039 bc
5 N5 A 7563.333 365.2839 45.78314 6827.961 8298.706 cd
6 N6 A 8700.667 365.2839 45.78314 7965.294 9436.039 d
7 N1 B 4306.000 365.2839 45.78314 3570.628 5041.372 a
8 N2 B 5982.000 365.2839 45.78314 5246.628 6717.372 b
9 N3 B 6259.000 365.2839 45.78314 5523.628 6994.372 b
12 N6 B 6540.333 365.2839 45.78314 5804.961 7275.706 b
10 N4 B 6895.000 365.2839 45.78314 6159.628 7630.372 b
11 N5 B 6950.667 365.2839 45.78314 6215.294 7686.039 b
13 N1 C 3177.333 365.2839 45.78314 2441.961 3912.706 a
14 N2 C 5442.667 365.2839 45.78314 4707.294 6178.039 b
15 N3 C 5994.000 365.2839 45.78314 5258.628 6729.372 b
16 N4 C 6014.000 365.2839 45.78314 5278.628 6749.372 b
18 N6 C 6065.333 365.2839 45.78314 5329.961 6800.706 b
17 N5 C 6687.333 365.2839 45.78314 5951.961 7422.706 b
24 N6 D 1880.667 365.2839 45.78314 1145.294 2616.039 a
23 N5 D 2046.667 365.2839 45.78314 1311.294 2782.039 a
22 N4 D 3816.000 365.2839 45.78314 3080.628 4551.372 b
19 N1 D 4481.333 365.2839 45.78314 3745.961 5216.706 b
21 N3 D 4812.000 365.2839 45.78314 4076.628 5547.372 b
20 N2 D 4816.000 365.2839 45.78314 4080.628 5551.372 b
withinG_mean_comparisons_tukey_re <- withinG_mean_comparisons_tukey_re %>%
  as_tibble() %>%
  mutate(N_G = paste0(N,"-",G)) %>%
  mutate(N_G = fct_reorder(N_G, emmean))

dat <- dat %>%
  mutate(N_G = paste0(N,"-",G)) %>%
  mutate(N_G = fct_relevel(N_G, levels(withinG_mean_comparisons_tukey_re$N_G)))

withinG_RCBD_Plot_tukey <- ggplot() +
  facet_wrap(~ G, labeller = label_both) + # facet per G level
  geom_point(data = dat, aes(y = yield, x = N, color = N)) + # dots representing the raw data
  geom_point(data = withinG_mean_comparisons_tukey_re, aes(y = emmean, x = N), color = "red", position = position_nudge(x = 0.1)) + # red dots representing the adjusted means
  geom_errorbar(data = withinG_mean_comparisons_tukey_re, aes(ymin = lower.CL, ymax = upper.CL, x = N), color = "red", width = 0.1, position = position_nudge(x = 0.1)) + # red error bars representing the confidence limits of the adjusted means
  geom_text(data = withinG_mean_comparisons_tukey_re, aes(y = emmean, x = N, label = str_trim(.group)), color =" red", position = position_nudge(x = 0.2), hjust = 0) + # red letters 
  scale_y_continuous(name = "Yield", limits = c(0, NA), expand = expansion(mult = c(0, 0.1))) +
  scale_x_discrete(name = NULL) +
  theme_classic() + # clearer plot format 
labs(caption = str_wrap("Black dots represent raw data. Red dots and error bars represent estimated marginal means ± 95% confidence interval per group. Means not sharing any letter are significantly different by the t-test at the 5% level of significance.", width = 120)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), legend.position = "bottom")
withinG_RCBD_Plot_tukey

Mixed effects models become even further complicated with you consider random slopes in addition to random intercepts. The model specification details found at this link can help you draft the specific syntax to represent the random effects of your model.

1.5 Random Slope and Intercept

1.5.1 Review the Experimental Design

These data come from the rikz dataset of 45 intertidal sites across 9 beaches in the Netherlands. We first want to model the change in richness of species along the position of the site relative to mean sea level (NAP) and a random variation of the intercept between the beaches (linear mixed effect model with a random intercept). We can also consider a model where the effect of the NAP on the response varies from one beach to another (linear mixed effect model with a random slope and intercept).

1.5.2 Importing the Data

data <- read_csv(here("data","Random_Slope_and_Intercept Model_Example.csv"))
## Rows: 45 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): Richness, Exposure, NAP, Beach, Site
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [45 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Richness: num [1:45] 11 10 13 11 10 8 9 8 19 17 ...
##  $ Exposure: num [1:45] 10 10 10 10 10 8 8 8 8 8 ...
##  $ NAP     : num [1:45] 0.045 -1.036 -1.336 0.616 -0.684 ...
##  $ Beach   : num [1:45] 1 1 1 1 1 2 2 2 2 2 ...
##  $ Site    : num [1:45] 1 2 3 4 5 1 2 3 4 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Richness = col_double(),
##   ..   Exposure = col_double(),
##   ..   NAP = col_double(),
##   ..   Beach = col_double(),
##   ..   Site = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
data <- mutate(data, Beach = as.factor(Beach), 
               Exposure = as.factor(Exposure))

1.5.3 Exploring the Data

ggplot(data, aes(y = Richness, x = NAP)) +
  geom_point() +
  xlab("NAP") +
  ylab("Richness") +
  theme_classic(base_size = 15) +
  stat_smooth(method = "lm", formula = 'y ~ x', se = F, fullrange = T) +
  facet_wrap(~ Beach)

# Does the richness change along the NAP gradient change between beaches?

1.5.4 Fitting ANOVA Models with R

With especially complicated models it is common to have convergence issues. The following link can help troubleshoot issues with lmer type models based on the warning/error messages printed after fitting a model.

mod_rint <- lmer(Richness ~ NAP + (1|Beach), data, REML = T)

mod_rsint <- lmer(Richness ~ NAP + (1 + NAP|Beach), data, REML = T)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00267151 (tol = 0.002, component 1)
# Restarting from previous fit and increased max number of iterations
ss <- getME(mod_rsint, c("theta", "fixef"))
mod_rsint2 <- update(mod_rsint, start = ss, control = lmerControl(optCtrl = list(maxfun = 2e4)), REML = T)
summary(mod_rsint2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Richness ~ NAP + (1 + NAP | Beach)
##    Data: data
## Control: lmerControl(optCtrl = list(maxfun = 20000))
## 
## REML criterion at convergence: 232.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8211 -0.3411 -0.1675  0.1924  3.0400 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  Beach    (Intercept) 12.596   3.549         
##           NAP          2.940   1.715    -0.99
##  Residual              7.307   2.703         
## Number of obs: 45, groups:  Beach, 9
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   6.5885     1.2648   5.209
## NAP          -2.8300     0.7229  -3.915
## 
## Correlation of Fixed Effects:
##     (Intr)
## NAP -0.819
# Compare random effects structure to see which fits the data best
AICcmodavg::aictab(
  cand.set = list(mod_rint, mod_rsint2), 
  modnames = c("Intercept", "Slope+Intercept"),
  second.ord = FALSE) # get AIC instead of AICc
## Warning in aictab.AIClmerMod(cand.set = list(mod_rint, mod_rsint2), modnames = c("Intercept", : 
## Model selection for fixed effects is only appropriate with ML estimation:
## REML (default) should only be used to select random effects for a constant set of fixed effects
Modnames K AIC Delta_AIC ModelLik AICWt Res.LL Cum.Wt
2 Slope+Intercept 6 244.3839 0.000000 1.0000000 0.824652 -116.1919 0.824652
1 Intercept 4 247.4802 3.096378 0.2126327 0.175348 -119.7401 1.000000
plot(residuals(mod_rint) ~ fitted(mod_rint))
abline(h = 0)

plot(residuals(mod_rsint2) ~ fitted(mod_rsint2))
abline(h = 0)

# Residual plot looks problematic - check omitted variable bias and include Exposure in model

mod_rsint1 <- lmer(Richness ~ NAP + (1 + NAP|Beach), data, REML = F)
## boundary (singular) fit: see help('isSingular')
mod_rsint2 <- lmer(Richness ~ NAP * Exposure + (1 + NAP|Beach), data, REML = F)
## boundary (singular) fit: see help('isSingular')
anova(mod_rsint1, mod_rsint2)
npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
mod_rsint1 6 246.6561 257.496 -117.3280 234.6561 NA NA NA
mod_rsint2 10 238.1993 256.266 -109.0997 218.1993 16.45673 4 0.0024637
# Interaction fixed effect has better model fitness - need to use REML=F to test fitness of fixed effects 

plot(residuals(mod_rsint2) ~ fitted(mod_rsint2))
abline(h = 0)

# Didn't resolve heteroscedasticity issue with residuals - try a count model

mod_glm_rsint <- glmer(Richness ~ NAP * Exposure + (1 + NAP|Beach), data, family = poisson(link = "log"))

plot(residuals(mod_glm_rsint) ~ fitted(mod_glm_rsint))
abline(h = 0)

sims=simulateResiduals(mod_glm_rsint) 
plot(sims, quantreg = T)

check_overdispersion(mod_glm_rsint) # No need for negative binomial distribution
## # Overdispersion test
## 
##        dispersion ratio =  0.895
##   Pearson's Chi-Squared = 32.228
##                 p-value =  0.649
## No overdispersion detected.
# Selected model steps
# 1) determine random effects structure using REML=T
# 2) determine fixed effects structure using REML=F
# 3) determine appropriate distribution normal, transformation, count, etc.
# 4) move onto assessing Anova results

car::Anova(mod_glm_rsint, type = 3) # cannot call test.statistic = "F" for glmerMod object
Chisq Df Pr(>Chisq)
(Intercept) 148.454216 1 0.0000000
NAP 1.316567 1 0.2512092
Exposure 29.587327 2 0.0000004
NAP:Exposure 1.573073 2 0.4554195
car::Anova(mod_glm_rsint, type = 2) 
Chisq Df Pr(>Chisq)
NAP 24.231343 1 0.0000009
Exposure 30.890508 2 0.0000002
NAP:Exposure 1.573073 2 0.4554195
summary(mod_glm_rsint)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: Richness ~ NAP * Exposure + (1 + NAP | Beach)
##    Data: data
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##     212.7     228.9     -97.3     194.7        36 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4268 -0.5756 -0.1351  0.1743  2.0062 
## 
## Random effects:
##  Groups Name        Variance Std.Dev. Corr 
##  Beach  (Intercept) 0.02686  0.1639        
##         NAP         0.05272  0.2296   -0.22
## Number of obs: 45, groups:  Beach, 9
## 
## Fixed effects:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.5349     0.2080  12.184  < 2e-16 ***
## NAP             -0.3037     0.2646  -1.147   0.2512    
## Exposure10      -0.5702     0.2447  -2.330   0.0198 *  
## Exposure11      -1.3585     0.2613  -5.200 1.99e-07 ***
## NAP:Exposure10  -0.3969     0.3166  -1.254   0.2100    
## NAP:Exposure11  -0.2685     0.3301  -0.813   0.4160    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) NAP    Exps10 Exps11 NAP:E10
## NAP         -0.151                             
## Exposure10  -0.851  0.128                      
## Exposure11  -0.795  0.121  0.675               
## NAP:Expsr10  0.127 -0.837 -0.124 -0.100        
## NAP:Expsr11  0.122 -0.801 -0.110 -0.117  0.673

1.5.5 Mean Comparisons

Now that the best fitting model has been chosen, we can move forward with interpreting the results.

Exposure_means <- mod_glm_rsint %>%
  emmeans(~ Exposure) %>%
  cld(Letters = letters, type = "response")
## NOTE: Results may be misleading due to involvement in interactions
Exposure_means
Exposure rate SE df asymp.LCL asymp.UCL .group
3 11 2.657518 0.4496860 Inf 1.907401 3.702631 a
2 10 5.590957 0.7752359 Inf 4.260489 7.336903 b
1 8 11.351010 2.4336616 Inf 7.456526 17.279553 c
trends <- mod_glm_rsint %>%
  emtrends(pairwise ~ Exposure, var = "NAP") %>%
    cld(Letters = letters, type = "response")
trends
Exposure NAP.trend SE df asymp.LCL asymp.UCL .group
2 10 -0.7005764 0.1734339 Inf -1.0405006 -0.3606523 a
3 11 -0.5722070 0.1976221 Inf -0.9595392 -0.1848748 a
1 8 -0.3036633 0.2646494 Inf -0.8223666 0.2150400 a

1.5.6 Data Visualization

We need to produce a trend line for each exposure level. We will cover two approaches that are more complicated than the previously used predict function.

summary(data)
##     Richness      Exposure      NAP              Beach         Site  
##  Min.   : 0.000   8 : 5    Min.   :-1.3360   1      : 5   Min.   :1  
##  1st Qu.: 3.000   10:20    1st Qu.:-0.3750   2      : 5   1st Qu.:2  
##  Median : 4.000   11:20    Median : 0.1670   3      : 5   Median :3  
##  Mean   : 5.689            Mean   : 0.3477   4      : 5   Mean   :3  
##  3rd Qu.: 8.000            3rd Qu.: 1.1170   5      : 5   3rd Qu.:4  
##  Max.   :22.000            Max.   : 2.2550   6      : 5   Max.   :5  
##                                              (Other):15
data$Site <- as.factor(data$Site)
new.x <- expand.grid(NAP = seq(-1.336, 2.255, length.out = 100),
                   Exposure = levels(data$Exposure),
                   Beach = levels(data$Beach),
                   Site = levels(data$Site))


new.x$Richness <- predict(mod_glm_rsint, new.x, re.form = NA, type = "response")

# There is no method using the predict function to derive standard errors and thus no confidence intervals using the predict function.
# Some complex matrix multiplication to produce 95% confidence intervals

mm <- model.matrix(terms(mod_glm_rsint), new.x)
pvar1 <- diag(mm %*% tcrossprod(vcov(mod_glm_rsint), mm))
tvar1 <- pvar1 + VarCorr(mod_glm_rsint)$Beach[1]
cmult <- 2

new.x <- data.frame(new.x
    , plo = new.x$NAP - cmult * sqrt(pvar1)
    , phi = new.x$NAP + cmult * sqrt(pvar1)
    , tlo = new.x$NAP - cmult * sqrt(tvar1)
    , thi = new.x$NAP + cmult * sqrt(tvar1))

# Fixed Effects Only
FE_Plot <- ggplot(data, aes(x = NAP, y = Richness, color = Exposure)) +
  geom_point(size = 5) +
  geom_line(data = new.x, aes(x = NAP, y = Richness, color = Exposure)) +
  geom_line(data = new.x, aes(x = NAP, y = Richness - plo, color = Exposure), lty = 2) +
  geom_line(data = new.x, aes(x = NAP, y = Richness + phi, color = Exposure), lty = 2) +
  scale_color_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
  scale_fill_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
  theme_classic()
FE_Plot
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

# CI based on FE and RE
FE_RE_Plot <- ggplot(data, aes(x = NAP, y = Richness, color = Exposure)) +
  geom_point(size = 5) +
  geom_line(data = new.x, aes(x = NAP, y = Richness, color = Exposure)) +
  geom_line(data = new.x, aes(x = NAP, y = Richness - tlo, color = Exposure), lty = 2) +
  geom_line(data = new.x, aes(x = NAP, y = Richness + thi, color = Exposure), lty = 2) +
  scale_color_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
  scale_fill_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
  theme_classic()
FE_RE_Plot
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

Exposure_means <- Exposure_means %>%
  as_tibble()

Plot_Exposure <- ggplot() +
  geom_point(data = data, aes(y = Richness, x = Exposure), position = position_jitter(width = 0.1)) + # dots representing the raw data
  geom_boxplot(data = data, aes(y = Richness, x = Exposure), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
  geom_point(data = Exposure_means, aes(y = rate, x = Exposure), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
  geom_errorbar(data = Exposure_means, aes(ymin = asymp.LCL, ymax = asymp.UCL, x = Exposure), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
  geom_text(data = Exposure_means, aes(y = rate, x = Exposure, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters 
  labs(y = "Richness", x = "Exposure", tag = "C") +
  theme_classic()
Plot_Exposure

Composite_Plot <- gridExtra::grid.arrange(FE_Plot, FE_RE_Plot, Plot_Exposure, nrow = 1)
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

ggsave("GLMER_Plot.tiff", plot = Composite_Plot, width = 50, height = 20, units = "cm", dpi = 300)

1.5.7 Aprroach using bootMer

new_x <- expand.grid(NAP = seq(-1.336, 2.255, length.out = 100),
                   Exposure = levels(data$Exposure),
                   Beach = levels(data$Beach),
                   Site = levels(data$Site))

mySumm<-function(.){
  predict(.,newdata = new.x, re.form = NA)
}
sumBoot <- function(merBoot) {
  return(
    data.frame(fit = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs = 0.5, na.rm = TRUE))),
               lwr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs = 0.025, na.rm = TRUE))),
               upr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs = 0.975, na.rm = TRUE)))
    )
  )
}

boot1 <- lme4::bootMer(mod_glm_rsint, mySumm, nsim = 250, use.u = FALSE, type = "parametric")
PI.boot1 <- sumBoot(boot1)
head(PI.boot1)
fit lwr upr
3.002342 1.848330 3.731325
2.989927 1.856632 3.704538
2.976820 1.864935 3.677143
2.961818 1.873238 3.648247
2.946816 1.880132 3.620555
2.932186 1.886986 3.593678
new_data <- data.frame(new_x, PI.boot1) %>%
  rename(Richness = fit) %>%
  mutate(Richness = exp(Richness),
         lwr = exp(lwr),
         upr = exp(upr))

# Incorporates random effects
FE_RE_Plot <- ggplot(data, aes(x = NAP, y = Richness, color = Exposure)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_line(data = new_data, aes(x = NAP, y = Richness, color = Exposure), linewidth = 2) +
  geom_line(data = new_data, aes(x = NAP, y = Richness - lwr, color = Exposure), lty = 2) +
  geom_line(data = new_data, aes(x = NAP, y = Richness + upr, color = Exposure), lty = 2) +
  scale_color_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
  scale_fill_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
  theme_classic()
FE_RE_Plot
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

boot2 <- lme4::bootMer(mod_glm_rsint, mySumm, nsim = 250, use.u = TRUE, type = "parametric")
PI.boot2 <- sumBoot(boot2)

new_data_2 <- data.frame(new_x, PI.boot2) %>%
  rename(Richness = fit) %>%
  mutate(Richness = exp(Richness),
         lwr = exp(lwr),
         upr = exp(upr))

# Includes fixed effects only
FE_RE_Plot_2 <- ggplot(data, aes(x = NAP, y = Richness, color = Exposure)) +
  geom_point(size = 5) +
  geom_line(data = new_data_2, aes(x = NAP, y = Richness, color = Exposure)) +
  geom_line(data = new_data_2, aes(x = NAP, y = Richness - lwr, color = Exposure), lty = 2) +
  geom_line(data = new_data_2, aes(x = NAP, y = Richness + upr, color = Exposure), lty = 2) +
  scale_color_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
  scale_fill_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
  theme_classic()
FE_RE_Plot_2
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

1.6 Practice Problem Repeated Measures

This type of analysis is sometimes called a longitudinal study where repeated observations are taken on individual subjects. In agricultural research, an individual plant may have repeated observations. In this example dataset eighteen patients participated in a study in which they were allowed only three hours of sleep per night and their reaction time in a specific test was observed. The underlying correlation structure results in observations from the same individual or subject should be more similar than observations between two individuals or subjects. An initial review of the dataset and random intercept model is tested.

Reaction - average reaction time (ms) Days - number of days of sleep deprivation Subject - subject number on which the observation was made

# Loading the Data
data <- read_csv(here("data","Random_Slope_and_Intercept_Model.csv"))
## Rows: 180 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Reaction, Days, Subject
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [180 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Reaction: num [1:180] 250 259 251 321 357 ...
##  $ Days    : num [1:180] 0 1 2 3 4 5 6 7 8 9 ...
##  $ Subject : num [1:180] 308 308 308 308 308 308 308 308 308 308 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Reaction = col_double(),
##   ..   Days = col_double(),
##   ..   Subject = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
data$Subject <- as.factor(data$Subject)

ggplot(data, aes(y = Reaction, x = Days)) +
    facet_wrap(~ Subject, ncol = 6) + 
    geom_point() + 
    geom_line()

mod_intercept <- lmer(Reaction ~ Days + (1|Subject), data = data)

# Review model assumptions
plot(resid(mod_intercept) ~ fitted(mod_intercept)) # small hint of curved relationship
abline(h = 0)

# Review contribution of each random effect and their predictors (BLUPs)
summary(mod_intercept)$varcor
##  Groups   Name        Std.Dev.
##  Subject  (Intercept) 37.124  
##  Residual             30.991
ranef(mod_intercept)$Subject # unique baseline for each subject
(Intercept)
308 40.783710
309 -77.849554
310 -63.108567
330 4.406442
331 10.216189
332 8.221238
333 16.500494
334 -2.996981
335 -45.282127
337 72.182686
349 -21.196249
350 14.111363
351 -7.862221
352 36.378425
369 7.036381
370 -6.362703
371 -3.294273
372 18.115747
car::Anova(mod_intercept, test.statistic = "F") # Days is a continuous variable - summary table gives us the slope of the line
F Df Df.res Pr(>F)
Days 169.4014 1 161 0
summary(mod_intercept)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: data
## 
## REML criterion at convergence: 1786.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2257 -0.5529  0.0109  0.5188  4.2506 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1378.2   37.12   
##  Residual              960.5   30.99   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 251.4051     9.7467   25.79
## Days         10.4673     0.8042   13.02
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.371

Each subject has their own baseline for reaction time and the subsequent measurements are relative to their baseline, so a random intercept will allow us to have each subject their unique baseline prediction. To visualize how well this model fits the data, we will plot the predicted values which are lines with y-intercepts that are equal to the sum of the fixed effect of intercept and the random intercept per subject. The slope for each patient is assumed to be the same and is 10.4673.

data <- data %>% 
  mutate(yhat = predict(mod_intercept, re.form = ~ (1|Subject))) # predict function calculated predictions based on model estimates and the re.form calculates the random intercepts
ggplot(data, aes(y = Reaction, x = Days)) +
    facet_wrap(~ Subject, ncol = 6) + 
    geom_point() + 
    geom_line() + # original lines from raw data
    geom_line(aes(y = yhat), color = 'red') # predicted lines from yhat values

Some subjects have less deviation from their predicted lines, but this assumes each subject has the same slope. We can fit a model that allows for each subject to have their own slope as well as their own y-intercept. The random slope will be calculated as a fixed effect of slope plus a random offset from that.

mod_SI <- lmer(Reaction ~ Days + (1 + Days|Subject), data = data)

# Review model assumptions
plot(resid(mod_SI) ~ fitted(mod_SI)) # curved relationship is no longer - some extreme observations??
abline(h = 0)

# Review contribution of each random effect and their predictors (BLUPs)
summary(mod_SI)$varcor
##  Groups   Name        Std.Dev. Corr 
##  Subject  (Intercept) 24.7407       
##           Days         5.9221  0.066
##  Residual             25.5918
ranef(mod_SI)$Subject # unique baseline and slope for each subject
(Intercept) Days
308 2.2585509 9.1989758
309 -40.3987381 -8.6196806
310 -38.9604090 -5.4488565
330 23.6906196 -4.8143503
331 22.2603126 -3.0699116
332 9.0395679 -0.2721770
333 16.8405086 -0.2236361
334 -7.2326151 1.0745816
335 -0.3336684 -10.7521652
337 34.8904868 8.6282652
349 -25.2102286 1.1734322
350 -13.0700342 6.6142178
351 4.5778642 -3.0152621
352 20.8636782 3.5360011
369 3.2754656 0.8722149
370 -25.6129993 4.8224850
371 0.8070461 -0.9881562
372 12.3145921 1.2840221
car::Anova(mod_SI, test.statistic = "F") # Days is a continuous variable - summary table gives us the slope of the line
F Df Df.res Pr(>F)
Days 45.85296 1 17 3.3e-06
summary(mod_SI) # estimate for slope (Days) doesn't change much
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 + Days | Subject)
##    Data: data
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4634  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.10   24.741       
##           Days         35.07    5.922   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.825  36.838
## Days          10.467      1.546   6.771
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

Review figure of each subject with their unique slope and intercept

data <- data %>% 
  mutate(yhat = predict(mod_SI, re.form = ~ (1 + Days|Subject)))
ggplot(data, aes(y = Reaction, x = Days)) +
    facet_wrap(~ Subject, ncol = 6) + 
    geom_point() + 
    geom_line() +
    geom_line(aes(y = yhat), color = 'red')

We have an eye-ball test that tells us the random slope and intercept prediction lines fit the data better. We can employ a formal test to compare the fitness of each model (random intercept and random slope+intercept).

anova(mod_intercept, mod_SI)
## refitting model(s) with ML (instead of REML)
npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
mod_intercept 4 1802.079 1814.850 -897.0393 1794.079 NA NA NA
mod_SI 6 1763.939 1783.097 -875.9697 1751.939 42.1393 2 0

The lower AIC and BIC values and the higher (less negative) log-likelihood value tells us that the random slope and intercept model is a better model than just a random intercept model.

We can review the results from the better fitting model with first the population estimate for the relationship between Days and Reaction time. Then the estimates for each subject.

data <- data %>% 
  mutate(yhat = predict(mod_SI, re.form = ~ 0))
ggplot(data, aes(x = Days, y = yhat)) +
  geom_point(aes(x = Days, y = Reaction)) +
  geom_line(color = 'red') + ylab('Reaction') +
  ggtitle('Population Estimated Regression Curve') +
  scale_x_continuous(breaks = seq(0, 9, by = 2))

data <- data %>% 
  mutate(yhat.ind = predict(mod_SI, re.form = ~ (1 + Days|Subject)))
ggplot(data, aes(x = Days)) +
  geom_line(aes(y = yhat), linewidth = 3) + 
  geom_line(aes(y = yhat.ind, group = Subject), color. ='red') +
  scale_x_continuous(breaks = seq(0, 9, by = 2)) +
  ylab('Reaction') + ggtitle('Person-to-Person Variation')
## Warning in geom_line(aes(y = yhat.ind, group = Subject), color. = "red"):
## Ignoring unknown parameters: `colour.`

The final step in producing a figure that explains the relationship would be to incorporate confidence intervals around the predicted relationship. A familiar approach to produce confidence intervals around prediction line uses the predict function again. This approach may be easier to include other variables from more complex models. Those terms can be added to the expand.grid function.

# Find range of values for new body size range
min.days <- min(sleepstudy$Days)
max.days <- max(sleepstudy$Days)
# New x data
new.x <- expand.grid(Days = seq(min.days, max.days, length = 1000), Subject = levels(data$Subject))
# Generate fits and standard errors at new.x values
new.y <- predict(mod_SI, newdata = new.x, se.fit = TRUE, re.form = NA)
new.y <- data.frame(new.y)
# housekeeping to put new.x and new.y together
addThese <- data.frame(new.x, new.y)
addThese <- rename(addThese, Reaction = fit)
# Add confidence intervals
addThese <- mutate(addThese, lwr = Reaction -1.96 * se.fit,
                          upr = Reaction + 1.96 * se.fit)
# See how the confidence intervals match the raw data
ggplot(data, aes(x = Days, y = Reaction)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(data = addThese, aes(ymin = lwr, ymax = upr), stat = "identity") +
  theme_classic()

Unfortunately, there isn’t a straight-forward way to easily incorporate the uncertainty of the variance components. Instead we have to rely on bootstrapping techniques (iterative process) to produce these quantities. This can be achieved using a function in the lme4 package following these steps:

  1. Generate a bootstrap sample
  2. Fit a model to that bootstrap sample
  3. From that model, calculate some statistic(s) you care about
  4. Repeat steps 1-4 many times to generate a bootstrap distribution of the statistics you care about
  5. From the bootstrap distribution generate confidence intervals for the value of interest

First we need to find out some details of the sample population

summary(mod_SI)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 + Days | Subject)
##    Data: data
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4634  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.10   24.741       
##           Days         35.07    5.922   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.825  36.838
## Days          10.467      1.546   6.771
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138
# Estimated intercept of 251.405
# Estimated slope of 10.467
# Subject intercept and slope random effect assumed to be normally distributed centered at zero and with estimated standard deviations of 24.741 and 5.922 (respectively)
# For a given subject's regression line, observations are just normal (mean zero, standard deviation 25.592) changes from the line.

# Creating bootstrap data
subject.intercept = 251.405 + rnorm(1, mean = 0, sd = 24.741)
subject.slope = 10.467 + rnorm(1, mean = 0, sd = 5.922)
c(subject.intercept, subject.slope)
## [1] 283.70730  12.78199
subject.obs <- data.frame(Days = 0:9) %>%
  mutate(Reaction = subject.intercept + subject.slope * Days + rnorm(10, sd=25.592))

ggplot(subject.obs, aes(x = Days, y = Reaction)) +
  geom_point()

This approach is commonly referred to as a “parametric” bootstrap because we are making some assumptions about the parameter distributions, whereas in a “non-parametric” bootstrap we don’t make any distributional assumptions. By default, the bootMer function will

  1. Perform a parametric bootstrap to create new datasets, using the results of the initial model
  2. Create a bootstrap model by analyzing the bootstrap data using the same model formula used by the initial model
  3. Apply some function you write to each bootstrap model - this function takes in a bootstrap model and returns a statistic or vector of statistics
  4. Repeat steps 1-3 repeatedly to create the bootstrap distribution of the statistics returned by your function in step 3
ConfData <- data.frame(Days = 0:9)   # What x-values I care about - much more complicated process when you have multiple variables in the model - need to create a matrix of those variables with a chosen value (e.g., mean or median)

# Get our best guess as to the relationship between day and reaction time
ConfData <- ConfData %>%
  mutate( Estimate = predict(mod_SI, newdata = ConfData, re.form = ~ 0))

# A function to generate yhat from a model
myStats <- function(model.star){
  out <- predict(model.star, newdata = ConfData, re.form = ~ 0)
  return(out)
}

# bootMer generates new data sets, calls lmer on the data to produce a model,
# and then produces calls whatever function I pass in. It repeats this `nsim` number of times.
bootObj <- lme4::bootMer(mod_SI, FUN = myStats, nsim = 1000 )

# Unfortunately, it doesn't allow for the bias-corrected and accelerated method, but the percentile method is ok for visualizaton.
hist(bootObj, ci = 'perc')

# Add the confidence interval values onto estimates data frame
CI <- confint(bootObj, level = 0.95, ci = 'bca') # Here we can get the bias-correct and accelerated option
colnames(CI) <- c('lwr', 'upr') 
ConfData <- cbind(ConfData, CI)

ConfData %>%
  ggplot(aes(x = Days)) +
  geom_line(aes(y = Estimate), color = 'red') +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = 'salmon', alpha = 0.2)

The prediction interval is the range of observed values. We can visualize the prediction interval in relation to the confidence interval. The simulate function creates the bootstrap dataset and doesn’t send it for more processing. It returns a vector of response values that are appropriately organized to be appended to the original dataset.

# # set up the structure of new subjects
PredData <- data.frame(Subject = 'new', Days = 0:9) # Simulate a NEW patient

# Create a n x 1000 data frame
Simulated <- simulate(mod_SI, newdata = PredData, allow.new.levels = TRUE, nsim = 1000)
 
# squish the Subject/Day info together with the simulated and then grab the quantiles
# for each day
PredIntervals <- cbind(PredData, Simulated) %>%
  gather('sim', 'Reaction', sim_1:sim_1000 ) %>%   # go from wide to long structure
  group_by(Subject, Days) %>%
  summarize(lwr = quantile(Reaction, probs = 0.025),
            upr = quantile(Reaction, probs = 0.975))
## `summarise()` has grouped output by 'Subject'. You can override using the
## `.groups` argument.
# Plot the prediction and confidence intervals
ggplot(ConfData, aes(x = Days)) +
  geom_point(data = data, aes(x = Days, y = Reaction)) +
  geom_line(aes(y = Estimate), color = 'red') +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = 'salmon', alpha = 0.2) +
  geom_ribbon(data = PredIntervals, aes(ymin = lwr, ymax = upr), fill = 'blue', alpha = 0.2)